In this analysis, we will analyze Formula 1 data from the year 2,000 and onward. The reason for this time frame is because of two things: the first begin that there no active drivers that began their F1 careers before the year 2,000, and the second being that, beginning in the year 2,000, it was mandated that all racecars have the same engine size, which would change overtime.
Furthermore, our analysis dives into the relationship between a driver’s starting position and the likelihood of obtaining a certain finishing result. One would assume that the better one qualifies, the higher the change of winning the race. Also, it is known that there are a few big teams/drivers who have dominated the sport over the years and continue to do so, thus providing relatively consistent results. One can easily assume that certain teams are much more likely to win a Grand Prix thus those teams are more statistically significant in determining the result of a race. Because of this, we also investigate the relationship between the results/points given the constructors (teams) and drivers.
Lastly, we create several prediction models that allow for predicting a race results. We use ONLY the parameters that one would have available just before a race: the team name, the driver name, the era (engine size) and the starting position.
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(ggplot2)
library(caret)
library(class)Starting Position Vs Race Result
Unfortunately,
due to lack of programming knowledge, we were unable to create the plot
that we wanted. However, we found exactly what we needed online and thus
decided to include that plot with the source tagged just below it.
From the plot, one can see that there does in-fact appear to be a
relationship between a driver’s starting position. It is clear that the
higher one starts on the grid, the higher the probability of winning the
race. In addition these findings, we also noted that the lower one
starts on the grid, the higher the probability of not finishing the race
(DFN), which could make sense if one assumes that drivers towards the
back are driving much less conservatively and thus more aggressive than
those towards the front.
Source:
MrBookman_LibraryCop”(Reddit User)
Click
Here to Visit Source
Constructor/Driver Vs Race Result
From the results below, it is clear that Mercedes has the highest
number of top positions throughout Formula 1 history, which is without a
doubt the reason for their extreme success in F1 racing. Higher average
position leads to higher average points and ultimately the season
trophy. Remember, one does not have to win every race in order to win
the season trophy - it’s all about points.
This plot
below depicts the drivers’ count by race result. Given the results from
the previous plot, one would expect a Mercedes driver to have the
highest number of top-5 finishing positions. Sure and behold, the Kubica
appears to have a higher number of top-5 finishing positions in
comparison to the other drivers. Amon appears to be a more consistent
driver but has less top-5 results.
F1DataFrame = read.csv("ModelingData/F1ModelingCustomCategoricalVariables.csv")
head(F1DataFrame)## X driverId constructorId circuitId raceId year name.x resultId
## 1 1 1 1 1 18 2008 Australian Grand Prix 1
## 2 2 1 1 35 875 2012 Korean Grand Prix 21601
## 3 3 1 1 11 46 2007 Hungarian Grand Prix 589
## 4 4 1 131 22 1026 2019 Japanese Grand Prix 24528
## 5 5 1 1 7 344 2010 Canadian Grand Prix 20491
## 6 6 1 1 4 5 2009 Spanish Grand Prix 7642
## grid positionOrder name.y alt name forename
## 1 1 1 Albert Park Grand Prix Circuit 10 McLaren Lewis
## 2 3 10 Korean International Circuit 0 McLaren Lewis
## 3 1 1 Hungaroring 264 McLaren Lewis
## 4 4 3 Suzuka Circuit 45 Mercedes Lewis
## 5 1 1 Circuit Gilles Villeneuve 13 McLaren Lewis
## 6 14 9 Circuit de Barcelona-Catalunya 109 McLaren Lewis
## surname fullname era
## 1 Hamilton Lewis Hamilton 90 degrees V8
## 2 Hamilton Lewis Hamilton 90 degrees V8
## 3 Hamilton Lewis Hamilton 90 degrees V8
## 4 Hamilton Lewis Hamilton 90 degrees V6 + MGUs
## 5 Hamilton Lewis Hamilton 90 degrees V8
## 6 Hamilton Lewis Hamilton 90 degrees V8
Logic to split data into training and testing sets.
modelVariables = c('positionOrder', 'grid', 'name', 'fullname', 'era')
modelDataframe = F1DataFrame[, modelVariables]
set.seed(4)
trainingIndices = sample(c(1:dim(modelDataframe)[1]), dim(modelDataframe)[1]*0.8)
trainingDataframe = modelDataframe[trainingIndices,]
testingDataframe = modelDataframe[-trainingIndices,]The Model Our initial model includes all of the data
from 1989 and onward without any data cleaning.
mymodel<-lm(positionOrder ~ grid + name + fullname + era, data =
F1DataFrame)
mymodel<-lm(positionOrder ~ grid + name + fullname + era, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)Model Summary Statistics
summary(mymodel)##
## Call:
## lm(formula = positionOrder ~ grid + name + fullname + era, data = F1DataFrame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.3122 -4.6733 -0.9341 4.2487 21.6598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.904265 0.847114 18.775 < 2e-16 ***
## grid -0.028042 0.009928 -2.825 0.004742 **
## nameAlphaTauri -2.885055 0.792023 -3.643 0.000271 ***
## nameAlpine F1 Team -3.321502 0.888146 -3.740 0.000185 ***
## nameArrows 1.445664 0.696238 2.076 0.037876 *
## nameBAR -1.136120 0.717750 -1.583 0.113469
## nameBMW Sauber -5.609388 0.830871 -6.751 1.52e-11 ***
## nameCaterham 2.767392 0.822187 3.366 0.000765 ***
## nameFerrari -4.594213 0.572806 -8.021 1.14e-15 ***
## nameForce India -2.050777 0.668144 -3.069 0.002149 **
## nameHaas F1 Team 0.068101 0.744726 0.091 0.927141
## nameHonda -0.202469 0.887057 -0.228 0.819457
## nameHRT 5.360363 0.830318 6.456 1.11e-10 ***
## nameJaguar 0.270974 0.769901 0.352 0.724874
## nameJordan 0.652334 0.634822 1.028 0.304162
## nameLotus F1 -2.651931 0.804287 -3.297 0.000979 ***
## nameMarussia 2.957332 0.829335 3.566 0.000364 ***
## nameMcLaren -3.608018 0.574446 -6.281 3.47e-10 ***
## nameMercedes -5.287063 0.662341 -7.982 1.55e-15 ***
## nameMinardi 2.152997 0.629614 3.420 0.000629 ***
## nameOther 3.265153 0.568550 5.743 9.50e-09 ***
## nameRed Bull -4.788048 0.637921 -7.506 6.48e-14 ***
## nameRenault -2.495175 0.621994 -4.012 6.06e-05 ***
## nameSauber -0.573243 0.592537 -0.967 0.333342
## nameToro Rosso -0.559865 0.616570 -0.908 0.363877
## nameToyota -1.596507 0.700748 -2.278 0.022724 *
## nameWilliams -2.315616 0.572091 -4.048 5.20e-05 ***
## fullnameCarlos Sainz -2.963388 0.843671 -3.512 0.000445 ***
## fullnameDaniel Ricciardo -2.795338 0.796197 -3.511 0.000448 ***
## fullnameDavid Coulthard -1.324745 0.788190 -1.681 0.092835 .
## fullnameFelipe Massa -3.282596 0.761831 -4.309 1.65e-05 ***
## fullnameFernando Alonso -3.674028 0.742275 -4.950 7.52e-07 ***
## fullnameGiancarlo Fisichella -3.780827 0.759391 -4.979 6.48e-07 ***
## fullnameJarno Trulli -2.536071 0.777466 -3.262 0.001109 **
## fullnameJenson Button -3.723810 0.761335 -4.891 1.01e-06 ***
## fullnameKevin Magnussen -1.425999 0.928718 -1.535 0.124696
## fullnameKimi Räikkönen -3.642620 0.748826 -4.864 1.16e-06 ***
## fullnameLance Stroll -4.521595 0.889203 -5.085 3.72e-07 ***
## fullnameLewis Hamilton -6.394946 0.799229 -8.001 1.33e-15 ***
## fullnameMark Webber -2.428414 0.806392 -3.011 0.002605 **
## fullnameMax Verstappen -5.027525 0.882818 -5.695 1.26e-08 ***
## fullnameMichael Schumacher -4.954514 0.762672 -6.496 8.51e-11 ***
## fullnameNick Heidfeld -1.755088 0.839081 -2.092 0.036485 *
## fullnameNico Hülkenberg -2.088303 0.767593 -2.721 0.006525 **
## fullnameNico Rosberg -3.314610 0.841876 -3.937 8.28e-05 ***
## fullnameOther -0.798539 0.647613 -1.233 0.217578
## fullnameRalf Schumacher -3.099923 0.828450 -3.742 0.000183 ***
## fullnameRomain Grosjean -1.305760 0.919098 -1.421 0.155427
## fullnameRubens Barrichello -3.498207 0.754031 -4.639 3.53e-06 ***
## fullnameSebastian Vettel -5.476040 0.762718 -7.180 7.34e-13 ***
## fullnameSergio Pérez -4.412941 0.732878 -6.021 1.77e-09 ***
## fullnameValtteri Bottas -4.581384 0.831545 -5.509 3.66e-08 ***
## era90 degrees V8 0.187351 0.193886 0.966 0.333913
## eraUp to 12 cylinders -1.030652 0.224903 -4.583 4.63e-06 ***
## eraV10 -1.714270 0.247002 -6.940 4.08e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.475 on 13975 degrees of freedom
## Multiple R-squared: 0.2512, Adjusted R-squared: 0.2483
## F-statistic: 86.82 on 54 and 13975 DF, p-value: < 2.2e-16
mymodel<-lm(positionOrder ~ grid, data = F1DataFrame)
ggplot(data = F1DataFrame, aes(x = grid, y = positionOrder, colour = era)) +
geom_point(position = "jitter", size = .8) +
labs(title="starting vs finishing position by era",
y="finishing position",
x="starting position")
mymodel<-lm(positionOrder ~ grid, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)F1DataFrame = F1DataFrame %>% filter(F1DataFrame$year >= "2000")
mymodel = lm(positionOrder ~ grid + name + fullname + era + grid:era, data = F1DataFrame)
simpleModel = lm(positionOrder ~ grid, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)Model 2 Summary Statistics
summary(mymodel)##
## Call:
## lm(formula = positionOrder ~ grid + name + fullname + era + grid:era,
## data = F1DataFrame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.305 -3.527 -1.062 2.713 19.492
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.18992 0.68222 14.936 < 2e-16 ***
## grid 0.35685 0.01675 21.306 < 2e-16 ***
## nameAlphaTauri -0.96330 0.60783 -1.585 0.113043
## nameAlpine F1 Team -2.01214 0.68050 -2.957 0.003116 **
## nameArrows 1.68047 0.70947 2.369 0.017875 *
## nameBAR -0.18805 0.58039 -0.324 0.745948
## nameBMW Sauber -3.24381 0.63884 -5.078 3.90e-07 ***
## nameCaterham 1.19126 0.63429 1.878 0.060402 .
## nameFerrari -3.45016 0.45662 -7.556 4.56e-14 ***
## nameForce India -1.61940 0.51491 -3.145 0.001666 **
## nameHaas F1 Team 0.37112 0.56909 0.652 0.514330
## nameHonda -0.41850 0.69192 -0.605 0.545302
## nameHRT 2.58153 0.65514 3.940 8.20e-05 ***
## nameJaguar 0.77193 0.60813 1.269 0.204344
## nameJordan 1.19768 0.58538 2.046 0.040788 *
## nameLotus F1 -1.64706 0.61482 -2.679 0.007399 **
## nameMarussia 1.13948 0.64156 1.776 0.075747 .
## nameMcLaren -1.58150 0.45005 -3.514 0.000443 ***
## nameMercedes -2.96328 0.52171 -5.680 1.39e-08 ***
## nameMinardi 1.17922 0.62166 1.897 0.057874 .
## nameOther 0.32732 0.46878 0.698 0.485051
## nameRed Bull -2.67785 0.49441 -5.416 6.24e-08 ***
## nameRenault -1.48690 0.47883 -3.105 0.001907 **
## nameSauber -0.28114 0.47039 -0.598 0.550072
## nameToro Rosso -0.30402 0.47383 -0.642 0.521137
## nameToyota -0.71284 0.54921 -1.298 0.194339
## nameWilliams -0.19097 0.44755 -0.427 0.669601
## fullnameCarlos Sainz -1.92574 0.64850 -2.970 0.002990 **
## fullnameDaniel Ricciardo -2.15690 0.61099 -3.530 0.000417 ***
## fullnameDavid Coulthard -1.13276 0.66200 -1.711 0.087092 .
## fullnameFelipe Massa -2.00781 0.59393 -3.381 0.000726 ***
## fullnameFernando Alonso -2.56675 0.57460 -4.467 8.03e-06 ***
## fullnameGiancarlo Fisichella -1.91949 0.60986 -3.147 0.001653 **
## fullnameJarno Trulli -0.71872 0.62250 -1.155 0.248299
## fullnameJenson Button -2.81085 0.58919 -4.771 1.86e-06 ***
## fullnameKevin Magnussen -1.31376 0.71091 -1.848 0.064634 .
## fullnameKimi Räikkönen -2.05271 0.58146 -3.530 0.000417 ***
## fullnameLance Stroll -2.88345 0.68776 -4.192 2.79e-05 ***
## fullnameLewis Hamilton -4.29182 0.62320 -6.887 6.09e-12 ***
## fullnameMark Webber -1.49148 0.62177 -2.399 0.016470 *
## fullnameMax Verstappen -3.24201 0.67972 -4.770 1.87e-06 ***
## fullnameMichael Schumacher -1.76263 0.66470 -2.652 0.008021 **
## fullnameNick Heidfeld -1.50224 0.64358 -2.334 0.019606 *
## fullnameNico Hülkenberg -1.05863 0.58787 -1.801 0.071766 .
## fullnameNico Rosberg -2.43295 0.65492 -3.715 0.000205 ***
## fullnameOther -1.04727 0.49630 -2.110 0.034871 *
## fullnameRalf Schumacher -2.27859 0.68996 -3.302 0.000962 ***
## fullnameRomain Grosjean -0.69624 0.70355 -0.990 0.322388
## fullnameRubens Barrichello -1.97576 0.63877 -3.093 0.001987 **
## fullnameSebastian Vettel -2.94648 0.58946 -4.999 5.88e-07 ***
## fullnameSergio Pérez -2.77168 0.56127 -4.938 8.02e-07 ***
## fullnameValtteri Bottas -3.38177 0.64398 -5.251 1.54e-07 ***
## era90 degrees V8 -0.19567 0.28781 -0.680 0.496616
## eraV10 -0.45452 0.35751 -1.271 0.203627
## grid:era90 degrees V8 0.03029 0.02157 1.404 0.160292
## grid:eraV10 -0.08097 0.02809 -2.882 0.003959 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.939 on 9104 degrees of freedom
## Multiple R-squared: 0.367, Adjusted R-squared: 0.3632
## F-statistic: 95.97 on 55 and 9104 DF, p-value: < 2.2e-16
summary(simpleModel)##
## Call:
## lm(formula = positionOrder ~ grid, data = F1DataFrame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.289 -3.824 -1.086 3.015 18.930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.070542 0.109514 46.30 <2e-16 ***
## grid 0.550774 0.008669 63.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.157 on 9158 degrees of freedom
## Multiple R-squared: 0.3059, Adjusted R-squared: 0.3058
## F-statistic: 4036 on 1 and 9158 DF, p-value: < 2.2e-16
V10plot = F1DataFrame %>% filter(F1DataFrame$era == "V10") %>% ggplot(aes(x = grid, y = positionOrder, colour = era)) +
geom_point(position = "jitter", size = .8) +
labs(title="Starting vs Finishing Position for V10 Era",
y="Finishing Position",
x="Starting Position") + theme(legend.position = "none")
V8plot = F1DataFrame %>% filter(F1DataFrame$era == "90 degrees V8") %>% ggplot(aes(x = grid, y = positionOrder, colour = era)) +
geom_point(position = "jitter", size = .8) +
labs(title="Starting vs Finishing Position for V8 Era",
y="Finishing Position",
x="Starting Position") + theme(legend.position = "none")
V6plot = F1DataFrame %>% filter(F1DataFrame$era == "90 degrees V6 + MGUs") %>% ggplot(aes(x = grid, y = positionOrder, colour = era)) +
geom_point(position = "jitter", size = .8) +
labs(title="Starting vs Finishing Position for V6 Era",
y="Finishing Position",
x="Starting Position") + theme(legend.position = "none")
V10plotV8plotV6plotset.seed(1234)
fitControl<-trainControl(method="repeatedcv",number=10,repeats=1)
knn.fit<-train(positionOrder~.,
data=modelDataframe,
method="knn",
trControl=fitControl,
tuneGrid=expand.grid(k=c(1:10,20,30))
)
knn.fit## k-Nearest Neighbors
##
## 14030 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 12626, 12629, 12627, 12626, 12627, 12627, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 1 5.895825 0.3962190 4.631315
## 2 5.738827 0.4196324 4.542079
## 3 5.666648 0.4309476 4.496665
## 4 5.614551 0.4395024 4.466336
## 5 5.583519 0.4447475 4.447944
## 6 5.566017 0.4475374 4.441407
## 7 5.561753 0.4481718 4.440443
## 8 5.551744 0.4499118 4.433137
## 9 5.543840 0.4512980 4.431609
## 10 5.538499 0.4521379 4.431004
## 20 5.519373 0.4552330 4.429668
## 30 5.528506 0.4537695 4.440500
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 20.
THANK YOU!